First we need to set up the environment and load the packages we will use for this workshop.

library(Seurat): Loads the Seurat package, which is a comprehensive toolkit for single-cell RNA sequencing and spatial transcriptomics data analysis. It provides a wide range of functions for data preprocessing, normalization, clustering, dimensionality reduction, and visualization. Explore documentation here: https://satijalab.org/seurat/

library(ggplot2): Loads the ggplot2 package, a powerful and flexible system for creating static visualizations in R. Explore documentation here: https://ggplot2.tidyverse.org/

library(scCustomize): Loads the scCustomize package, which provides custom functions and themes to enhance the visualization and analysis capabilities of single-cell and spatial transcriptomics data, often in conjunction with Seurat. Explore documentation here: https://samuel-marsh.github.io/scCustomize/

library(readr): Loads readr package for fast and friendly reading of rectangular data, such as CSV files, into R.

library(pheatmap): Loads pheatmap package, which is for creating pretty heatmaps, offering better control over heatmap customization compared to base R.

library(matrixStats): matrixStats provides highly optimized functions for matrix operations, particularly useful for computing row and column summaries.

library(spdep): spdep stands for Spatial Dependence and Spatial Autocorrelation, and it provides functions for spatial data analysis, including spatial weights generation, spatial autocorrelation statistics, and spatial regression.

library(geojsonR) The geojsonR library is used for handling GeoJSON data in R. GeoJSON is a format for encoding a variety of geographic data structures using JavaScript Object Notation (JSON). It is sometimes used as a format for storing cell segmentation boundaries.

library(Seurat)
library(ggplot2)
library(scCustomize)
library(readr)
library(pheatmap)
library(matrixStats)
library(spdep)
library(geojsonR)

Sets the path to the directory containing the Xenium output data - this is the directory where all of the outputs are stored.

data_dir <- "/project/shared/spatial_data_camp/datasets/DATASET2/XENIUM_COLORECTAL_CANCER/"

ReadXenium reads Xenium spatial transcriptomics data from a specified directory using a Seurat wrapper function that supports this data format. Xenium data typically includes expression matrices and spatial coordinates, along with other information about cell centroids and segmentations and coordinates of individual transcripts.

data_dir: The path to the directory containing the Xenium data, set in the previous step. outs = c(“matrix”, “microns”): Specifies the outputs to read from the data directory. matrix refers to summarised cell by gene matrix and microns refers to individual transcript coordinates.

type = c(“centroids”, “segmentations”): Indicates the types of spatial information to include - here, we are reading ib both cell centroid coordinates and cell boundary segmentations.

data <- ReadXenium(data_dir, outs = c("matrix", "microns"), type=c("centroids", "segmentations"))
10X data contains more than one type and is being returned as a list containing matrices of each type.
|--------------------------------------------------|
|==================================================|

This provides us a list of data:

names(data)
[1] "matrix"        "microns"       "centroids"     "segmentations"

Matrix is further split into gene expression matrix and various control probes and codewords. Different platforms and platform versions include different control probes. As this will vary, it’s important to check and understand what the specific controls in your own data are.

Here, negative control probes are probes that are added to the reaction but target non-biological sequences and should not bind any tissue RNA. Negative control codewords are valid codewords, but no probes with that codeword added to the reaction. This effectively tells us how good the transcript calling algorithm is.

names(data$matrix)
[1] "Gene Expression"           "Negative Control Codeword" "Negative Control Probe"   
[4] "Unassigned Codeword"      

Read in additional information about the cells - this gives us pre-calculated information, for example segmented cell or nucleus size for each cell.

cell_meta_data <- read.csv(file.path(data_dir, "cells.csv.gz"))
rownames(cell_meta_data) <- cell_meta_data$cell_id
head(cell_meta_data)

We will start by creating a basic seurat object from the data.

CreateSeuratObject function initializes a Seurat object using the provided gene expression matrix and optional metadata.

counts: The gene expression matrix, which contains the raw count data for each gene in each cell. data$matrix[[“Gene Expression”]]: Specifies the gene expression matrix extracted from the loaded Xenium data. Here, we leave out the control probes for now.

assay: The name of the assay - you can call it anything you like. Here, we go with “XENIUM”.

meta.data: Metadata associated with the cells or spots. Here, we add the cell statistics we read in earlier as cell_meta_data.

By printing the seurat object, we can see that we read in ~ 30,000 cells with measures for 325 genes

seurat <- CreateSeuratObject(counts = data$matrix[["Gene Expression"]],
                                 assay = "XENIUM",
                                 meta.data = cell_meta_data)
seurat
An object of class Seurat 
325 features across 647524 samples within 1 assay 
Active assay: XENIUM (325 features, 0 variable features)
 1 layer present: counts

Adding spatial coordinates to a Seurat object allows for spatially resolved analysis and visualization. This requires creating objects for centroids and segmentations we read in earlier, and then integrating these with the main Seurat object.

CreateFOV: This function creates a field of view (FOV) object that includes spatial information about the centroids, segmentations, and molecule coordinates. An FOV can be the entire slide, or a selected region within a slide - i.e. it does not need to have entries for all the cells in the seurat object.

coords: A list containing the centroids and/or segmentation data. For larger datasets, it can be quicker to only load centroids, as this minimises the amount of data points.

centroids = CreateCentroids(data\(centroids)*: Creates a centroids object from the centroid data in the Xenium dataset. *segmentation = CreateSegmentation(data\)segmentations): Creates a segmentation object from the segmentation data in the Xenium dataset.

type = c(“segmentation”, “centroids”): Specifies the types of spatial data being included, which are segmentation and centroid data.

molecules = data$microns: The spatial coordinates of individual transcripts/molecules in the data. This is optional - for larger datasets, skipping transcript coordinates can be a good idea.

seurat[[“COLON”]] <- coords: Adds the created FOV object to the Seurat object under the new FOV name “COLON”. This can be named (almost) anything - but, avoid using underscores as this can create some unexpected behaviours later.

TIP: LoadXenium() is a wrapper that would load in both cell counts matrix and spatial coordinates in one function, simplifying these steps. However, in situ platforms are evolving at a very fast rate and there are constant changes on how the data is stored, in particular for file formats for cell segmentation and coordinates. Here, we have broken down the steps to show how to assemble an in situ seurat object from the key components, in case the platform specific readers don’t work for your specific data.

coords <- CreateFOV(coords = list(centroids = CreateCentroids(data$centroids), 
                                  segmentation = CreateSegmentation(data$segmentations)),
                    type = c("segmentation", "centroids"),
                    molecules = data$microns,
                    assay = "XENIUM")
seurat[["COLONC2"]] <- coords  

Inspect the object - now, you can see we have added a spatial field of view:

To subset the object

Adding control probes and codewords as separate assays in the Seurat object allows for the tracking and analysis of technical artifacts and noise within your spatial transcriptomics data, while keeping these outputs separate from the main biological gene expression values.

Unassigned codewords are unused codewords. There is no probe in a particular gene panel that will generate the codeword.

Negative control probes are probes that exist in the panels but target non-biological sequences. They can be used to assess the specificity of the assay.

Negative control codewords are codewords in the codebook that do not have any probes matching that code. They are chosen to meet the same requirements as regular codewords and can be used to assess the specificity of the decoding algorithm.

seurat[["Negative.Control.Codeword"]] <- CreateAssayObject(counts = data$matrix[["Negative Control Codeword"]])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
seurat[["Negative.Control.Probe"]] <- CreateAssayObject(counts = data$matrix[["Negative Control Probe"]])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
seurat[["Unassigned.Codeword"]] <- CreateAssayObject(counts = data$matrix[["Unassigned Codeword"]])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')

subset an object

seurat #read the object
An object of class Seurat 
541 features across 647524 samples within 4 assays 
Active assay: XENIUM (325 features, 0 variable features)
 1 layer present: counts
 3 other assays present: Negative.Control.Codeword, Negative.Control.Probe, Unassigned.Codeword
 2 spatial fields of view present: COLONC2 CRC2

Let’s start with some basic QC and visualisation of the data.

In Seurat, in situ spatial transcriptomics counterpart functions to ‘SpatialDimPlot’ and ‘SpatialFeaturePlot’ we covered yesterday are called ‘ImageFeaturePlot’ and ‘ImageDimPlot’. These have additional functionality to plot cell segmentations and individual transcript coordinates, but otherwise function exactly the same as the sequencing based ST counterparts.

First, lets visualise the total transcripts detected per cell.

As in scRNA-Seq data, this is the most basic measure of overall signal and how well the data looks.

Unlike in scRNA-Seq data or unbiased sequencing-based ST, these measures are also very heavily dependent not only on the total RNA quantity of each cell and tissue quality, but also on the target panel used for the experiment. Under-represented cell types will naturally yield fewer transcripts. Finally, the quality of cell segmentation also plays a role.

In this case, we can see that there are areas with higher and lower total transcripts detected.

Understanding your tissue and target panel here is important to delineate where these differences are biological and where they may be technical.

Similarly, we can visualise the total number of gene detected per cell. You can see that this is a bit less variable across tissue.

This can also suggest that there cells at the top of the epithelial crypts in this sample with genes detected at high copy number than the rest of the tissue.

ImageFeaturePlot(seurat_CRC2, "nFeature_XENIUM", axes = T) + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

This code examines the distribution of the number of features (genes) detected per cell in the Seurat object using a density plot and calculates specific quantiles of this distribution. This is important for understanding the variability and distribution of detected features, which can help identify potential issues such as low-quality cells and determine any filtering thresholds that may need to be applied.

If you’re coming from scRNA-Seq work, these low numbers probably look very alarming. How can you possibly work with 31 median genes per cell?

Unlike scRNA-Seq data and sequencing-based ST, both gene dropouts and noise are much, much lower in in situ ST data.

We are also working with 100-fold fewer targetted genes.

quantile(seurat_CRC2$nFeature_XENIUM, c(0.01, 0.1, 0.5, 0.9, 0.99))
 1% 10% 50% 90% 99% 
  5  14  33  56  76 

Using ImageFeaturePlot to visualize the cell area in spatial transcriptomics data allows us to examine the spatial organization and potential heterogeneity of cell sizes within your tissue sample.

Why do we get such a difference in spatial distribution of cell sizes?

This could be due to biological differences between small and large cells - e.g. small cells like T-cells.

However, here the signal correlates with areas of low cellularisation. Therefore, it is likely this is an artefact of nuclei expansion in cell segmentation.

What is Nuclei Expansion?

Nuclei expansion in cell segmentation refers to the process of enlarging the segmented nuclei regions to approximate the boundaries of the entire cells. This technique is used to better represent the actual cell boundaries when only the nuclei have been explicitly segmented/we only have DAPI and no additional cell boundary staining. The primary goal is to provide a more accurate estimation of the cellular area, which is crucial for various downstream analyses in spatial transcriptomics and single-cell studies. In this case, nuclei expansion is constrained either by maximum distance or other nearby cells - so, where there are no other nearby cells to “bump into”, the expansion generates artificially bigger cells.

ImageFeaturePlot(seurat_CRC2, "cell_area", axes = T) + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

We can further check that this is likely the case by plotting the ratio between nuclei and total cell area. We can see that there is a very big decrease in percentage of cell area occupied by nucleus in areas of low cell density.

The cell-to-nucleus area ratio can also potentially provide insights into cell morphology, cell type and potential changes in cellular states or conditions. For example, T-Cells can often be quite well identified by this variable alone, as they have a small cytoplasm volume. However, without a cell boundary stain, this metric mainly captures segmentation artefacts, so be careful about over-interpretation!

ImageFeaturePlot(seurat_CRC2, "cell_nucleus_ratio") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

If we look at the distribution, we see that we have a big tail end of overly large cells.

In this case, we can see that as expected, there is generally a correlation between cell area and transcript detection rate.

However, we also have a group of cells where this is not the case - very large cells but relatively few transcripts. These cells are mainly submucosal stromal cells which are very poorly covered by the panel 10x have used.

We can create a filter to remove the overly large cells from the analysis.

quantile(seurat$cell_area, 0.99): Calculates the 99th percentile of the cell_area values in the Seurat object. This value serves as a threshold to identify the largest 1% of cells - but what is a sensible threshold, if any, depends on your tissue.

seurat\(cell_area < quantile(seurat\)cell_area, 0.99): Compares each cell’s area to the 99th percentile threshold. The result is a logical vector where each element is TRUE if the corresponding cell’s area is less than the 99th percentile and FALSE otherwise.

seurat[[“SIZE_FILTER_LARGE”]]: Creates a new metadata field named SIZE_FILTER_LARGE in the Seurat object, storing the logical vector.

seurat_CRC2[["SIZE_FILTER_LARGE"]] <- seurat_CRC2$cell_area < quantile(seurat_CRC2$cell_area, .99)

Now we can use ImageDimPlot to visualise the cells which have been flagged for removal.

We can see that these are mostly in the submucosa region.

How do different thresholds behave? Is there a more appropriate one to use? Is any necessary at all?

ImageDimPlot(seurat_CRC2, group.by="SIZE_FILTER_LARGE")

We can use the same approach to create a filter for segmented cells which are very small and likely segmentation arfetacts.

quantile(seurat$cell_area, 0.01): Calculates the 1st percentile of the cell_area values in the Seurat object. This value serves as a threshold to identify the smallest 1% of cells.

seurat\(cell_area > quantile(seurat\)cell_area, 0.01): Compares each cell’s area to the 1st percentile threshold. The result is a logical vector where each element is TRUE if the corresponding cell’s area is greater than the 1st percentile and FALSE otherwise.

seurat[[“SIZE_FILTER_SMALL”]]: Creates a new metadata field named SIZE_FILTER_SMALL in the Seurat object, storing the logical vector.

seurat_CRC2[["SIZE_FILTER_SMALL"]] <- seurat_CRC2$cell_area > quantile(seurat_CRC2$cell_area, .01)

Now we can use ImageDimPlot to visualise the cells which have been flagged for removal.

We can see that these are more scattered throughout the tissue - but there may be more in the follicular regions.

How do different thresholds behave? Is there a more appropriate one to use? Is any necessary at all?

We can check how these values correlate with gene detection rate.

If we filter out small cells, we will remove cells with low numbers of genes detected.

If we filter out large cells, this is not that biased towards overly large counts, as we saw before.

Adjusting the threshold for what is considered a “small cell” can have significant implications for your analysis, especially in areas with specific cell types such as T-cells, which are small and densely packed in follicular regions. This example demonstrates how changing the threshold to the 10th percentile affects the filtering. In this case, we would probably filter out a lot of good cells that we don’t want to lose! So, be careful when looking at these types of QC metrics!

seurat_CRC2[["SIZE_FILTER_SMALL"]] <- seurat_CRC2$cell_area > quantile(seurat_CRC2$cell_area, .1)
ImageDimPlot(seurat_CRC2, group.by="SIZE_FILTER_SMALL")

Lets set this back to the original 1% threshold.

seurat_CRC2[["SIZE_FILTER_SMALL"]] <- seurat_CRC2$cell_area > quantile(seurat_CRC2$cell_area, .01)

The most important filter is the overall transcript detection. Empty cells or cells with very low transcript count cannot be taken forward for clustering analysis and it is extremely difficult to identify what they may be. Here, we set a threshold of minimum 15 transcripts. This seems quite low - for data from in situ platforms with low noise (Xenium, Merfish, Merscope), this is generally enough to cluster and identify cell types. If your data has more noise (e.g. CosMx), a higher threshold is more appropriate.

seurat\(nCount_XENIUM >= 15*: Compares each cell's transcript count to the threshold of 15. The result is a logical vector where each element is TRUE if the corresponding cell has at least 15 transcripts and FALSE otherwise. *seurat\)TRANSCRIPT_FILTER: Creates a new metadata field named TRANSCRIPT_FILTER in the Seurat object, storing the logical vector.

seurat_CRC2$TRANSCRIPT_FILTER <- seurat_CRC2$nCount_XENIUM >= 15

And we can visualise the cells that we would lose.

We see that we disproportionately would filter out more cells from some regions than others. As pointed out previously, this is likely due to a combination of gene panel coverage in some regions and very small cells in densely packed regions like follicles.

ImageDimPlot(seurat_CRC2, group.by="TRANSCRIPT_FILTER")

Finally, visualizing the counts of negative control codewords, negative control probes, and unassigned codewords helps identify and understand technical artifacts and background noise in your spatial transcriptomics data.

Here, we can see that all control probes and codewords produce yield very little signal, suggesting our data is good quality!

In some cases, high amount of autoflourescence is the cells/tissue can sometimes generate false positive signal and this should be filtered out.

ImageFeaturePlot(seurat_CRC2, "nCount_Negative.Control.Codeword") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "nCount_Negative.Control.Probe") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "nCount_Unassigned.Codeword") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

Although the negative control signal is low, we can nonetheless create a filter to remove cells which have any, although in this case it is probably unnecessary.

seurat_CRC2$PROBE_FILTER <- seurat_CRC2$nCount_Unassigned.Codeword == 0 &
                       seurat_CRC2$nCount_Negative.Control.Codeword == 0 &
                       seurat_CRC2$nCount_Negative.Control.Probe == 0
ImageDimPlot(seurat_CRC2, group.by="PROBE_FILTER")

Finally, we can subset the seurat object based on any/all of the filters we have created earlier.

By combining probe, size, and transcript filters, you can retain only the cells that meet all quality criteria, reducing the impact of technical artifacts and noise on your analysis.

seurat_CRC2 <- subset(seurat_CRC2, PROBE_FILTER & SIZE_FILTER_LARGE & SIZE_FILTER_SMALL & TRANSCRIPT_FILTER)
Warning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating Seurat objects

Lets examine the cleaned up object - we have lost a few thousand cells from the analysis.

Data Normalisation

The SCTransform function in Seurat is used for normalizing single-cell RNA-seq and spatial transcriptomics data. This method models the gene expression counts using a regularized negative binomial regression and removes technical noise while preserving biological variability. The clip.range parameter is used to limit the range of the transformed values, which can help stabilize downstream analyses by limiting the influence of extreme values.

seurat_CRC2 <- SCTransform(seurat_CRC2, assay = "XENIUM", clip.range = c(-10, 10))
Running SCTransform on assay: XENIUM
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 325 by 73985
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 320 genes, 5000 cells
Found 38 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 325 genes
Computing corrected count matrix for 325 genes
Calculating gene attributes
Wall clock passed: Time difference of 21.71334 secs
Determine variable features
Centering data matrix

  |                                                                                       
  |                                                                                 |   0%
  |                                                                                       
  |=================================================================================| 100%
Getting residuals for block 1(of 15) for counts dataset
Getting residuals for block 2(of 15) for counts dataset
Getting residuals for block 3(of 15) for counts dataset
Getting residuals for block 4(of 15) for counts dataset
Getting residuals for block 5(of 15) for counts dataset
Getting residuals for block 6(of 15) for counts dataset
Getting residuals for block 7(of 15) for counts dataset
Getting residuals for block 8(of 15) for counts dataset
Getting residuals for block 9(of 15) for counts dataset
Getting residuals for block 10(of 15) for counts dataset
Getting residuals for block 11(of 15) for counts dataset
Getting residuals for block 12(of 15) for counts dataset
Getting residuals for block 13(of 15) for counts dataset
Getting residuals for block 14(of 15) for counts dataset
Getting residuals for block 15(of 15) for counts dataset
Centering data matrix

  |                                                                                       
  |                                                                                 |   0%
  |                                                                                       
  |=================================================================================| 100%
Finished calculating residuals for counts
Set default assay to SCT

Principal Component Analysis (PCA) is a dimensionality reduction technique used to identify the primary axes of variation in high-dimensional data. In the context of spatial transcriptomics, PCA helps to reduce the complexity of the data while preserving the most important patterns of variation.

TIP: If your target panel is very small, you can skip this step and carry out clustering analysis directly on gene expression. This can sometimes help with achieving better clustering results.

seurat_CRC2 <- RunPCA(seurat_CRC2)
PC_ 1 
Positive:  IGFBP7, THBS1, TIMP3, DPYSL3, CTSB, MAF, IFITM1, CYBB, VCAN, ETS1 
       ANXA1, CXCR4, PLXND1, CLU, APOE, TRAC, RPS4Y1, SERPINA1, IL7R, MS4A7 
       RNASE1, CD79A, SOCS3, FZD7, TRBC2, DEPP1, CD14, CD3E, CCL5, CD2 
Negative:  CD24, SLC12A2, RRM2, HMGB2, TYMS, PPP1R1B, EPHB3, CDCA7, CA2, FERMT1 
       SOX9, STMN1, PCLAF, C1QBP, REG4, AQP1, CMBL, MKI67, TK1, CEACAM5 
       EGFR, IMPDH2, S100P, SMOC2, CREB3L1, GATA2, UBE2C, MUC12, TUBA1A, LGR5 
PC_ 2 
Positive:  CTSB, APOE, CYBB, RNASE1, MS4A7, SERPINA1, CD14, C1QC, C1QA, CD163 
       C1QB, FYB1, CCL4, MAF, CCL5, IL7R, GPR183, CXCR4, CD83, TRBC2 
       CD8A, TNFSF13B, CD2, CD3E, TRAC, GZMA, CTLA4, PLXND1, TIGIT, CD3D 
Negative:  THBS1, IGFBP7, DPYSL3, TIMP3, VCAN, FZD7, ALDH1B1, AQP1, DEPP1, CLU 
       CD24, RUNX1T1, CES1, SELENOM, CDKN2B, IFITM1, CPE, EPHB3, FRZB, HMGB2 
       CKAP4, RRM2, EGFR, MEIS2, TUBA1A, IMPDH2, TYMS, CA2, SLC12A2, C1QBP 
PC_ 3 
Positive:  MS4A1, TRBC2, TRAC, CD2, CD3E, CXCR4, CD8A, CCL5, GZMK, CD79A 
       SPOCK2, IL7R, CTLA4, GZMA, CD3G, ETS1, TIGIT, CD3D, KLRB1, CD6 
       BANK1, CST7, LTB, NKG7, SPIB, CD5, LRMP, ITK, TRBC1, FOXP3 
Negative:  THBS1, CTSB, IGFBP7, RNASE1, APOE, CD14, MS4A7, CYBB, C1QC, SERPINA1 
       TIMP3, PLXND1, C1QA, CD163, C1QB, VCAN, ALDH1B1, FZD7, DPYSL3, TUBA1A 
       CD24, AQP1, CPE, CES1, SLC12A2, DEPP1, RUNX1T1, SOCS3, CA2, CEACAM5 
PC_ 4 
Positive:  CD79A, MS4A1, CLU, SEC11C, BANK1, CXCR4, LRMP, SPIB, FKBP11, TNFRSF17 
       DERL3, CD79B, RGS13, TCL1A, PRDX4, SMIM14, FCRL1, IRF8, SELENOK, PAX5 
       CYBB, CXCR5, CD83, LTB, SELL, FCER2, GPR183, C2orf88, MS4A7, DPYSL3 
Negative:  CCL5, CD8A, GZMA, CD2, CD3E, TRBC2, TRAC, CTLA4, CD3G, CCL4 
       NKG7, CD3D, THBS1, TIGIT, IGFBP7, KLRB1, GZMK, CST7, CD6, SPOCK2 
       IL7R, MAF, TIMP3, GNLY, FOXP3, CD8B, ITK, CD5, CD24, ID2 
PC_ 5 
Positive:  IGFBP7, TIMP3, IFITM1, PLXND1, AQP1, ETS1, CPE, SEC11C, SOCS3, FKBP11 
       CD79A, PRDX4, VCAN, DERL3, FRZB, TNFRSF17, LEF1, SELENOK, LYVE1, ODF2L 
       ROBO1, CDKN2B, CKAP4, GIMAP7, ANXA1, CA2, TUBA1A, AFAP1L2, TNFRSF25, RNASE1 
Negative:  THBS1, CLU, DPYSL3, ALDH1B1, FZD7, CES1, SELENOM, MS4A1, MAOB, MEIS2 
       CYBB, RUNX1T1, CEACAM5, TRAC, TRBC2, SPIB, CD14, CD2, TCL1A, MS4A7 
       MAF, LTB, CD3G, BANK1, CD8A, DEPP1, CTLA4, CD3E, TIGIT, NOVA1 

As before, we can visualise how much variation is captured by each PC.

The ElbowPlot function helps to determine the number of significant PCs to use for downstream analyses. The plot typically shows the amount of variance explained by each PC, and the “elbow” point indicates a natural cutoff.

ElbowPlot(seurat_CRC2, 50)

Plotting the top genes contributing to a specific principal component helps in understanding the biological factors driving the variation captured by that component. This type of plot highlights the genes with the highest loadings, which are the most influential in the principal component analysis.

PC_Plotting(seurat_CRC2, dim_number = 1)

The FeaturePlot function in Seurat is used to visualize the expression of a specific gene across cells in a given dimensionality reduction space (e.g., PCA). This helps to understand how the expression of a gene varies across the principal components.

FeaturePlot(seurat_CRC2, "CEACAM5", reduction = "pca") + scale_color_viridis_c()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

We can also examine how various PCs are distributed spatially.

Here, we can see that high PC1 loadings enrich in follicular structures and low PC1 loadings enrich in crypt top cells.

ImageFeaturePlot(seurat_CRC2, "PC_1") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

We can plot the expression of high (or low) loading genes to visualise how this correlates with our dimensionality reduction.

ImageFeaturePlot(seurat_CRC2, "IGFBP7", size=.5) + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

Next, we will use the reduced dimensionality data for clustering and cluster visualisation.

RunUMAP: Perform Uniform Manifold Approximation and Projection (UMAP) to reduce the dimensionality of the data for visualization. The UMAP plot reduces the high-dimensional data to two dimensions, preserving the local and global structure of the data for visualization. Cells that are close together in the UMAP plot are similar in their gene expression profiles. seurat: The Seurat object. dims = 1:20: Specifies the principal components to use for UMAP.

FindNeighbors: Finding nearest neighbors helps to identify cells that are similar based on their PCA scores, which is used for clustering. seurat: The Seurat object. reduction = “pca”: Specifies that the PCA space should be used for finding neighbors. dims = 1:20: Specifies the principal components to use for identifying neighbors.

FindClusters: Clustering identifies distinct groups of cells with similar gene expression patterns. The resolution parameter controls the granularity of the clustering. seurat: The Seurat object. resolution = 0.7: Sets the resolution parameter for clustering. Higher values lead to more clusters, while lower values lead to fewer clusters.

seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.2)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 73985
Number of edges: 2363377

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9173
Number of communities: 6
Elapsed time: 52 seconds

Next lets visualise the clusters - firstly, based on transcriptome embedding.

DimPlot: Creates a scatter plot of cells in a reduced-dimensional space, by default now using UMAP dimensionality reduction. seurat: The Seurat object containing the dimensionality reduction results and cluster assignments. label = TRUE: Adds cluster labels to the plot. repel = TRUE: Repels the labels to avoid overlapping, making the plot clearer.

And now lets plot the clusters in tissue space.

We can see that our clusters have quite nice correspondence to distinct spatial regions.

ImageDimPlot(seurat_CRC2, size=.5)
Warning: No FOV associated with assay 'SCT', using global default FOV

As before, now we can use Seurat differential expression functions to identify marker genes for specific cell clusters.

FindMarkers: Identifies genes that are differentially expressed in a specified cluster compared to all other cells. seurat: The Seurat object containing the gene expression data and cluster identities. ident.1 = “0”: Specifies the cluster of interest for which marker genes are to be identified. In this case, cluster “0”. max.cells.per.ident = 500: Limits the number of cells to be used from each cluster for the differential expression analysis to 500. This can help to speed up the computation.

markers <- FindMarkers(seurat_CRC2, ident.1="0", max.cells.per.ident=500)

We can visualise expression of cluster specific markers using feature plots

ImageFeaturePlot(seurat_CRC2, "CD3E", size=.5) + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "MS4A1", size=.5) + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "CEACAM5", size=.5) + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "KIT", size=.5) + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

Or, as in our sequencing ST tutorial, detect and visualise top markers for every cluster.

markers <- FindAllMarkers(seurat_CRC2, max.cells.per.ident = 500)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5

scCustomize package provides a convenient helper function, Extract_Top_Markers, to extract the top marker genes for each cluster from the output of FindAllMarkers. This function simplifies the process of identifying and retrieving the most significant marker genes for analysis and visualisation.

In this case, we are extracting the top five markers per cluster.

top
 [1] "DMBT1"    "REG4"     "MLPH"     "FERMT1"   "EGFR"     "THBS1"    "CPE"     
 [8] "DEPP1"    "CES1"     "RUNX1T1"  "APOE"     "C1QB"     "C1QC"     "C1QA"    
[15] "CD14"     "MS4A1"    "CD7"      "CTLA4"    "GZMA"     "TRBC2"    "MUC12"   
[22] "CEACAM5"  "OLFM4"    "CEACAM6"  "CD177"    "DERL3"    "CPA3"     "TNFRSF17"
[29] "SLC18A2"  "MS4A2"   

Clustered_DotPlot function from the scCustomize package provides a convenient and visually appealing way to display expression patterns of top marker genes across clusters using a dot plot. This function not only plots the expression data but also clusters the genes and groups for enhanced visual interpretation. This is an alternative to Seurat DotPlot function.

k = 18: Determines the number of clusters for the hierarchical clustering of genes to enhance visual separation of expression patterns.

We can see that most clusters have unique markers, which suggests the dataset is not over-clustered.

Clustered_DotPlot(seurat_CRC2, features = top, k=18)
[[1]]

[[2]]

Additional Spatial Visualisations

The resolution of in situ datasets is typically very high and so it can be difficult to visualise everything in one plot. Below, we will explore different visualisations that can help unpick and understand the data a bit better.

To better visualise spatial distribution of clusters, sometimes it can be useful to subset only certain groups to reduce crowding. Here, we specifically only visualising two selected clusters.

WhichCells: Identifies cells based on specified criteria. seurat: The Seurat object. expression = seurat_clusters %in% c(0, 5): Logical expression to select cells belonging to clusters 0 and 5.

This works with ImageFeaturePlot too. Try it with some genes!

ImageDimPlot(seurat_CRC2, cells=WhichCells(seurat_CRC2, expression = seurat_clusters %in% c(0, 5)))
Warning: No FOV associated with assay 'SCT', using global default FOV

Sometimes, it can be useful to create additional fields of view of the data - for example, zooms of specific regions. First, let’s look at the coordinate system by plotting the data and turning on the plotting of the axes, which are off by default to create nicer looking plots.

This gives us a rough idea on where in the coordinate system to create any subsets or zooms of the data.

For example, if we want to zoom in on the follicle in the top right corner, we can see that it lies roughly between 4000-5000 and 8000-9000 coordinate regions.

ImageDimPlot(seurat_CRC2, axes = T)
Warning: No FOV associated with assay 'SCT', using global default FOV

So, let’s create a new FOV with these coordinates. For this, we can use the Crop function.

seurat[[“COLON”]]: The spatial assay to be cropped. x = c(4200, 5000): The x-axis range for the crop. y = c(8000, 8800): The y-axis range for the crop. coords = “plot”: Specifies the coordinate system to use (typically “plot” for spatial coordinates).

seurat[[“ROI1”]] <- cropped: Adds the cropped region as a new FOV named “ROI1” in the Seurat object. This could be a more informative name, but avoid using underscores!

cropped <- Crop(seurat[["COLON"]], x = c(4200, 5000), y = c(8000, 8800), coords = "plot")
seurat[["ROI1"]] <- cropped

Now we can limit our visualisations just to this region by specifying the name of the new FOV as an “fov” arguement.

As we are zooming in closer to the tissue, we can also switch from plotting cell centroids (i.e. dots) by default to visualising cell segmentation boundaries. Plotting cell boundary polygons for large FOVs can be quite time consuming, and doesn’t provide much more detail on a fully zoomed-out view.

ImageDimPlot(seurat, fov="ROI1", boundaries="segmentation", border.color = "black" )

We can visualise gene expression or other continous variable on the new FOV as before.

For example, here we have MS4A1/CD20 expression, which is a B-Cell marker. We can see it quite nicely limited to the lymphoid follicle.

ImageFeaturePlot(seurat, "MS4A1", fov="ROI1", boundaries="segmentation" , border.color = "black") + scale_fill_viridis_c()

We can also overlay the coordinates of individual molecules to the plot. For example, here we are added some more T-cell and B-cell specific markers.

This visualisation can be useful because molecules are stored independently of cells and cell boundaries in Seurat. Therefore, if there are regions where cell segmentation is not good, or if cells were filtered out from clustering analysis due to their low quality, the molecules will remain and can still be visualised this way.

For example, here we can see there are a few molecules of CXCR5 detected outside of cellular boundaries.

ImageFeaturePlot(seurat, "MS4A1", fov="ROI1", boundaries="segmentation", molecules=c("CXCR5", "FOXP3"), mols.size = .5, border.color = "black" ) + scale_fill_viridis_c()

Cell Type Identification

You can manually annotate your cell clusters, or you can classify them using a reference single-cell dataset. This process is simpler than for Visium data because our data is at the single-cell level, establishing a one-to-one relationship without the need for spot deconvolution.

However, our transcriptome is more limited here, and some cell types may not be well represented. Additionally, our single-cell reference might be missing some cell types that are not well captured by droplet-based technologies but are present in our tissue data.

In this example, we will use a single-cell reference dataset that we prepared earlier.

We will start by reading in the seurat RDS file.

ref <- readRDS("/project/shared/spatial_data_camp/datasets/SINGLE_CELL_REFERENCES/COLON_HC_5K_CELLS.RDS")

Examine the object:

ref
An object of class Seurat 
33556 features across 5725 samples within 3 assays 
Active assay: RNA (33538 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 2 other assays present: HTO, ADT
 2 dimensional reductions calculated: pca, umap

And plot the pre-computed cell clusters. We can see that here we have quite high level annotation.

We want to evaluate how much structural information is lost in single-cell data when limiting ourselves to the targeted gene set. Accurate cluster prediction is challenging if the current gene set does not adequately identify them. To do this, we will quickly re-embedd the data using only the genes present in our spatial transcriptomics data and keep the original cluster annotations derived from unbiased data.

In this example, we can observe that the limited gene set does a reasonably good job at distinguishing major cell populations. However, it struggles to differentiate between similar cell types, such as myofibroblasts and fibroblasts, as effectively as before.

If we visualise the specificity of the gene panel across our single cell reference clusters, we can see that the panel coverage is mainly concentrated across epithelial cells and T-Cells and other immune cells, with few specific markers expressed by stromal cells.

Next, we can use the standard Seurat integration and cross-classification workflow to transfer single-cell derived labels to our spatial object.

Briefly, the first function identifies anchors between the reference single-cell dataset (ref) and the query spatial dataset (seurat). Anchors are pairs of cells that are considered similar between the datasets. The normalization.method = “SCT” specifies that SCTransform normalization should be used.

The second step transfers the cell type labels from the reference dataset to the query dataset. The anchorset argument specifies the anchors found in the previous step. The refdata = ref$CellType argument specifies the cell type labels from the reference dataset to be transferred. The prediction.assay = TRUE argument indicates that the transferred labels should be stored in a new assay in the query dataset. The weight.reduction = seurat[[“pca”]] argument specifies the dimensionality reduction to be used for weighting the transfer, and dims = 1:30 specifies the number of dimensions to use.

seurat_CRC2 <- TransferData(anchorset = anchors, 
                       refdata = ref$CellType, 
                       prediction.assay = TRUE,
                       weight.reduction = seurat_CRC2[["pca"]], 
                       query = seurat_CRC2, 
                       dims=1:30)
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
Warning: Layer counts isn't present in the assay object; returning NULL

Unfortunately, the predicted labels and spatial clusters do not correspond clearly in all cases. This discrepancy is particularly evident in the middle regions of the UMAP, where many cells are predicted as epithelial cells - probably incorrectly!

How to improve this?

Ensure Good Representation of Cell Type Markers in in situ Target Panel Most critically, before undertaking any experiments you want to ensure that there is good representation of all cell types in your target panel - in this case, there is not much to be done as the data has already been generated.

Review and Refine Reference Data: Ensure that the reference single-cell dataset is comprehensive and accurately annotated. If certain cell types are not well represented or annotated in the reference dataset, it can lead to misclassification.

Increase the Number of Dimensions: Increasing the number of dimensions used in the UMAP and PCA steps might capture more variance in the data, leading to better label transfer.

Filter and Preprocess Data: Filtering out low-quality cells or genes and performing additional preprocessing steps can enhance the accuracy of the transfer anchors and, consequently, the label predictions.

Manually Annotate or Correct Predictions: In cases where automatic label transfer is insufficient, consider manually annotating or correcting the predictions for critical regions to ensure accuracy.

As before, we can also visualise the predicted cell labels in tissue space.

ImageDimPlot(seurat_CRC2, group.by = "predicted.id")
Warning: No FOV associated with assay 'NEIGHBOURHOOD100', using global default FOV

In line with non-specific predictions, we can also see that the prediction score across these areas is lower.

Outside of stromal cells, we can also see that prediction probability can be low in cells that embedd “between” clusters, for example between core T-Cells and B-Cells, two populations that should be distinct.

This is often the case where cell segmentation is imperfect and partitions transcripts in such a way that it generates “artificial” doublets by pulling in transcripts from an adjacent cell.

FeaturePlot(seurat, "predicted.id.score")

For example, if we visualise the lineage markers for T-Cells and B-Cells, we can see that they are often “co-expressed” in the same cells when biologically, they should not be.

The FeatureScatter function in Seurat is used to create a scatter plot showing the relationship between the expression levels of two genes across all cells. This visualization helps to identify potential correlations or patterns between the two genes.

ImageDimPlot(seurat_CRC2,  boundaries="segmentation", border.color = "black" )
Warning: No FOV associated with assay 'SCT', using global default FOV

ImageDimPlot(seurat_CRC2)
Warning: No FOV associated with assay 'SCT', using global default FOV

ImageDimPlot(seurat_CRC2, group.by = "predicted.id")
Warning: No FOV associated with assay 'SCT', using global default FOV

Spatial Neighbourhood Analyis

neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 20, return.neighbor=TRUE)
Computing nearest neighbors

Computing nearest neighbors

ImageDimPlot(seurat_CRC2)
Warning: No FOV associated with assay 'SCT', using global default FOV

Finding Spatially Correlated Genes

neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 50)
Computing nearest neighbor graph
Computing SNN
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 50)
Computing nearest neighbor graph
Computing SNN
mt <- LayerData(seurat_CRC2, layer = "counts", assay = "XENIUM")
sum_mtx <- as.matrix(neighbours$nn %*% t(mt))

We can store the neighbourhood-aggregated values in our Seurat object as a separate assay, which we will call “NEIGHBOURHOOD50”. We then normalise the matrix.

seurat_CRC2 <- NormalizeData(seurat_CRC2, assay = "NEIGHBOURHOOD50")
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|

We can then apply quick correlation calculations to identify spatially correlated features.

modules <- cutree(heatmap$tree_row, 5)
modules
  AKR7A3    ANXA1     APOE    BANK1     C1QA     C1QB    C1QBP     C1QC      CA2     CCL5 
       1        2        3        4        3        3        1        3        1        2 
    CD14    CD163      CD2     CD24     CD3D     CD3E     CD3G      CD5      CD6    CD79A 
       3        3        2        1        2        2        2        2        2        2 
   CD79B     CD8A    CDCA7  CEACAM5  CEACAM6     CES1      CLU     CMBL      CPE  CREB3L1 
       4        2        1        5        5        2        2        1        4        5 
    CST7    CTLA4     CTSB    CXCR4     CYBB    DEPP1    DERL3   DPYSL3     EGFR    EPHB3 
       2        2        3        2        2        2        4        2        1        1 
    ETS1   FERMT1   FKBP11    FOXP3     FRZB     FYB1     FZD7   GALNT5    GATA2   GIMAP7 
       2        1        4        2        2        2        2        5        1        2 
  GPR183     GZMA     GZMK    HMGB2   IFITM1   IGFBP7   IL17RB     IL7R   IMPDH2      ITK 
       2        2        2        1        2        2        1        2        1        2 
   KLRB1  KRTCAP3     LGR5     LRMP      LTB      MAF    MKI67     MLPH    MS4A1    MS4A7 
       2        1        1        2        2        2        1        1        4        3 
   MYH14     NKG7      PBK    PCLAF    PLPP2   PLXND1  PPP1R1B    PTTG1     REG4   RNASE1 
       5        4        1        1        1        2        1        1        1        3 
   RNF43    ROBO1     RORA   RPS4Y1     RRM2  RUNX1T1    S100P   SEC11C  SELENOM SERPINA1 
       1        2        2        2        1        2        1        4        2        3 
   SFXN1  SLC12A2    SMOC2    SOCS3     SOX9   SPOCK2    STMN1    THBS1    TIGIT    TIMP3 
       1        1        1        2        1        2        1        2        2        2 
     TK1      TKT  TNFAIP3 TNFRSF17 TNFSF13B     TRAC    TRBC1    TRBC2     TYMS    UBE2C 
       1        1        2        2        2        2        2        2        1        1 
    VCAN 
       2 

Lets visualize some of the detected spatially co-localizing genes. For example, module 2 genes - we can see that CEACAM6 and AQP8 are spatially similar, but not necessarily always expressed by the same cells.

ImageFeaturePlot(seurat_CRC2, "CD24") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'NEIGHBOURHOOD100', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "ANXA1") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'NEIGHBOURHOOD100', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

seurat_CRC2 <- AddModuleScore(seurat_CRC2, features=split(names(modules), modules), assay = "SCT", nbin=3, name = "MOD" )

Visualising module scores - we can see that we have identified a group of genes co-localising at the base of the epithelial crypts (MOD1) and another module of genes co-localising in lymphoid follicles.

ImageFeaturePlot(seurat_CRC2, "MOD1") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "MOD2") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "MOD3") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "MOD4") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ImageFeaturePlot(seurat_CRC2, "MOD5") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

Detecting Cellular Niches

neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 100)
Computing nearest neighbor graph
Computing SNN
diag(neighbours$nn) <- 0 # dont count transcriptome of the cell itself, just neighbours
mt <- LayerData(seurat_CRC2, layer = "counts", assay = "XENIUM")
sum_mtx <- as.matrix(neighbours$nn %*% t(mt))

How is this useful? Well, now you can cluster cells not on their gene expression values, but gene expression values of surrounding cells. This effectively partitions cells not based on their identity, but on their micro-environment! Using this approach, you can identify tissue niches

Alternative approaches - you could count cell types rather than gene expression values, but that requires you to have finalised cell annotation for your dataset, which is not ideal. So, we do unbiased transcriptomics approach.

How would you run this with cell types?

seurat_CRC2[["NEIGHBOURHOOD100"]] <- CreateAssayObject(t(sum_mtx))
DefaultAssay(seurat_CRC2) <- "NEIGHBOURHOOD100"
seurat_CRC2 <- NormalizeData(seurat_CRC2)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
seurat_CRC2 <- ScaleData(seurat_CRC2, features = rownames(seurat_CRC2))
Centering and scaling data matrix

  |                                                                                        
  |                                                                                  |   0%
  |                                                                                        
  |==================================================================================| 100%
seurat_CRC2 <- RunPCA(seurat_CRC2, features = rownames(seurat_CRC2))
PC_ 1 
Positive:  C1QBP, HMGB2, CMBL, PPP1R1B, TYMS, SLC12A2, STMN1, KRTCAP3, CDCA7, FERMT1 
       CD24, TK1, PCLAF, CA2, IMPDH2, PLPP2, MLPH, EGFR, EPHB3, SOX9 
       RRM2, S100P, SFXN1, UBE2C, REG4, MKI67, RNF43, IL17RB, PBK, SMOC2 
Negative:  RPS4Y1, DPYSL3, SELENOM, ETS1, MAF, RORA, TRAC, ROBO1, GIMAP7, ANXA1 
       SPOCK2, CD79A, RUNX1T1, CD3E, CD2, TRBC2, IFITM1, DEPP1, THBS1, CD3G 
       IL7R, FYB1, VCAN, IGFBP7, CTLA4, GPR183, CD3D, TNFAIP3, GZMK, CXCR4 
PC_ 2 
Positive:  APOE, CEACAM5, C1QB, CTSB, CCL4, CEACAM6, SERPINA1, C1QA, C1QC, CYBB 
       RNASE1, MS4A7, FABP2, CD14, CD83, CEACAM1, CCL5, GZMA, TNFSF13B, CD163 
       MYH14, NKG7, CD8A, FYB1, IL1B, COL17A1, CDK6, RETNLB, RHOV, TBC1D4 
Negative:  MEIS2, NOVA1, ALDH1B1, CLU, AQP1, AGTR1, ID2, GNA11, FZD7, SLC6A8 
       IMPDH2, CES1, PDE4C, SEC11C, ADH1C, LYVE1, MAOB, EGFR, WFDC2, RGMB 
       CTSG, KIT, ANK2, PRDX4, ROBO2, EBPL, TKT, SELENOK, DERL3, PRPH 
PC_ 3 
Positive:  ALDH1B1, GNA11, C1QC, RNASE1, MEIS2, CKAP4, C1QB, CD163, CTSB, C1QA 
       PROX1, RGMB, TUBA1A, CD14, CES1, AGTR1, SLC6A8, MAOB, CPE, MS4A7 
       KIT, FZD7, THBS1, WFDC2, ETV1, MS4A2, CPA3, RUNX1T1, TIMP3, PLXND1 
Negative:  PAX5, CXCR5, FCRL1, MS4A1, SPIB, TCL1A, BANK1, CHI3L2, FCER2, FCRLA 
       CCR7, IRF8, SELL, CD40LG, CD83, TRAT1, VPREB3, CD79B, SMIM14, LTB 
       IER5, COL19A1, GZMK, MKI67, LGALS2, CXCR4, BATF, PTTG1, DNASE1L3, CD6 
PC_ 4 
Positive:  INSM1, ASCL2, TIMP3, IGFBP7, EPHB3, GATA2, THBS1, TUBB, VCAN, MUC12 
       PRDX4, LEF1, CDCA7, ETV1, CPE, PCLAF, FRZB, REG4, HES6, MAOB 
       RRM2, SCG2, LGR5, IFITM1, FERMT1, CD24, CA2, ROBO1, DPYSL3, AFAP1L2 
Negative:  CES2, SDCBP2, SMIM14, COL17A1, SLPI, CDHR5, HHLA2, SLC6A8, DMBT1, HDC 
       FABP2, ANXA13, TFF1, AREG, GPRC5C, PDE4C, BCAS1, RHOV, SULT1B1, GNA11 
       RNASE1, GALNT8, CEACAM6, MYH14, ODF2L, MEIS2, CFTR, UGT2B17, PDZK1IP1, DUOX2 
PC_ 5 
Positive:  PRDX4, LGALS2, CTSB, RNASE1, ID2, CA2, GNLY, CCL4, IL1B, NKG7 
       CD163, CES2, SLPI, KLRC2, C1QC, SERPINA1, TUBA1A, AKR7A3, PLCE1, AQP1 
       GPRIN3, GNA11, CDHR5, CXCL3, GZMA, EBPL, C1QA, CCL5, MLPH, PSTPIP2 
Negative:  FOXA3, MAOB, LEFTY1, PROX1, MUC12, RAB26, CDK6, HEPACAM2, RETNLB, KLK1 
       WFDC2, SMIM14, OLFM4, ATOH1, HES6, RGMB, INSM1, CEACAM6, CRYBA2, CHGB 
       SCNN1A, MB, CEACAM1, ASCL2, MS4A8, FCER2, RFX6, CHGA, CEACAM5, TNFRSF25 
seurat_CRC2 <- FindNeighbors(seurat_CRC2, reduction = "pca", dims = 1:10)
Computing nearest neighbor graph
Computing SNN
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.1, cluster.name = "Niches")
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 73985
Number of edges: 1833699

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9640
Number of communities: 9
Elapsed time: 10 seconds

Lets visualise the detected “niches”. We can see that we have achieved a coarse partioning of the cells into crypt top, mid-crypt and crypt-base regions, as well as segmenting out follicles and sub-mucosal stroma.

How would you tweak the above approach to generate more or less granular niches?

ImageDimPlot(seurat_CRC2, group.by = "Niches")
Warning: No FOV associated with assay 'NEIGHBOURHOOD100', using global default FOV

We can tabulate our detected niches with predicted cell type labels (or clusters) to visualise enrichment of different cell types across spatial niches.

For example, as could be expected, T-Cells and B-Cells enrich in Niche 2 (follicular).

saveRDS(seurat_CRC2, "seurat_CRC2_annot.RDS")

change the color

ImageDimPlot(seurat_CRC2, cols = cell_colours, group.by = "predicted.id")
Warning: No FOV associated with assay 'NEIGHBOURHOOD100', using global default FOV

merged <- readRDS("CRC_merge")
Warning: cannot open compressed file 'CRC_merge', probable reason 'No such file or directory'Error in gzfile(file, "rb") : cannot open the connection
---
title: "Human Colon Xenium in situ ST Dataset, Nuclei Segmentation"
output: html_notebook
---


First we need to set up the environment and load the packages we will use for this workshop. 

*library(Seurat)*: Loads the Seurat package, which is a comprehensive toolkit for single-cell RNA sequencing and spatial transcriptomics data analysis. It provides a wide range of functions for data preprocessing, normalization, clustering, dimensionality reduction, and visualization. Explore documentation here: https://satijalab.org/seurat/

*library(ggplot2)*: Loads the ggplot2 package, a powerful and flexible system for creating static visualizations in R. Explore documentation here: https://ggplot2.tidyverse.org/

*library(scCustomize)*: Loads the scCustomize package, which provides custom functions and themes to enhance the visualization and analysis capabilities of single-cell and spatial transcriptomics data, often in conjunction with Seurat. Explore documentation here: https://samuel-marsh.github.io/scCustomize/

*library(readr)*: Loads readr package for fast and friendly reading of rectangular data, such as CSV files, into R.

*library(pheatmap)*: Loads pheatmap package, which is for creating pretty heatmaps, offering better control over heatmap customization compared to base R.

*library(matrixStats)*: matrixStats provides highly optimized functions for matrix operations, particularly useful for computing row and column summaries. 

*library(spdep)*: spdep stands for Spatial Dependence and Spatial Autocorrelation, and it provides functions for spatial data analysis, including spatial weights generation, spatial autocorrelation statistics, and spatial regression.

*library(geojsonR)* The geojsonR library is used for handling GeoJSON data in R. GeoJSON is a format for encoding a variety of geographic data structures using JavaScript Object Notation (JSON). It is sometimes used as a format for storing cell segmentation boundaries.

```{r}
library(Seurat)
library(ggplot2)
library(scCustomize)
library(readr)
library(pheatmap)
library(matrixStats)
library(spdep)
library(geojsonR)
```
Sets the path to the directory containing the Xenium output data - this is the directory where all of the outputs are stored.
```{r}
data_dir <- "/project/shared/spatial_data_camp/datasets/DATASET2/XENIUM_COLORECTAL_CANCER/"
```

*ReadXenium* reads Xenium spatial transcriptomics data from a specified directory using a Seurat wrapper function that supports this data format. Xenium data typically includes expression matrices and spatial coordinates, along with other  information about cell centroids and segmentations and coordinates of individual transcripts. 

*data_dir*: The path to the directory containing the Xenium data, set in the previous step.
*outs = c("matrix", "microns")*: Specifies the outputs to read from the data directory. matrix refers to summarised cell by gene matrix and microns refers to individual transcript coordinates.

*type = c("centroids", "segmentations")*: Indicates the types of spatial information to include - here, we are reading ib both cell centroid coordinates and cell boundary segmentations.


```{r}
data <- ReadXenium(data_dir, outs = c("matrix", "microns"), type=c("centroids", "segmentations"))
```
This provides us a list of data:
```{r}
names(data)
```
Matrix is further split into gene expression matrix and various control probes and codewords. Different platforms and platform versions include different control probes. As this will vary, it's important to check and understand what the specific controls in your own data are.  

Here, negative control probes are probes that are added to the reaction but target non-biological sequences and should not bind any tissue RNA. Negative control codewords are valid codewords, but no probes with that codeword added to the reaction. This effectively tells us how good the transcript calling algorithm is.

```{r}
names(data$matrix)
```
Read in additional information about the cells - this gives us pre-calculated information, for example segmented cell or nucleus size for each cell.
```{r}
cell_meta_data <- read.csv(file.path(data_dir, "cells.csv.gz"))
rownames(cell_meta_data) <- cell_meta_data$cell_id
head(cell_meta_data)
```

We will start by creating a basic seurat object from the data. 

*CreateSeuratObject* function initializes a Seurat object using the provided gene expression matrix and optional metadata.

*counts*: The gene expression matrix, which contains the raw count data for each gene in each cell.
*data$matrix[["Gene Expression"]]*: Specifies the gene expression matrix extracted from the loaded Xenium data. Here, we leave out the control probes for now. 

*assay*: The name of the assay - you can call it anything you like. Here, we go with "XENIUM". 

*meta.data*: Metadata associated with the cells or spots. Here, we add the cell statistics we read in earlier as *cell_meta_data*.

By printing the *seurat* object, we can see that we read in ~ 30,000 cells with measures for 325 genes

```{r}
seurat <- CreateSeuratObject(counts = data$matrix[["Gene Expression"]],
                                 assay = "XENIUM",
                                 meta.data = cell_meta_data)
seurat
```

Adding spatial coordinates to a Seurat object allows for spatially resolved analysis and visualization. This requires creating objects for centroids and segmentations we read in earlier, and then integrating these with the main Seurat object.

*CreateFOV*: This function creates a field of view (FOV) object that includes spatial information about the centroids, segmentations, and molecule coordinates. An FOV can be the entire slide, or a selected region within a slide - i.e. it does not need to have entries for all the cells in the seurat object.

*coords*: A list containing the centroids and/or segmentation data. For larger datasets, it can be quicker to only load centroids, as this minimises the amount of data points. 

*centroids = CreateCentroids(data$centroids)*: Creates a centroids object from the centroid data in the Xenium dataset.
*segmentation = CreateSegmentation(data$segmentations)*: Creates a segmentation object from the segmentation data in the Xenium dataset.

*type = c("segmentation", "centroids")*: Specifies the types of spatial data being included, which are segmentation and centroid data.

*molecules = data$microns*: The spatial coordinates of individual transcripts/molecules in the data. This is optional - for larger datasets, skipping transcript coordinates can be a good idea.

*seurat[["COLON"]] <- coords*: Adds the created FOV object to the Seurat object under the new FOV name "COLON". This can be named (almost) anything - but, avoid using underscores as this can create some unexpected behaviours later.

TIP: *LoadXenium()* is a wrapper that would load in both cell counts matrix and spatial coordinates in one function, simplifying these steps. However, *in situ* platforms are evolving at a very fast rate and there are constant changes on how the data is stored, in particular for file formats for cell segmentation and coordinates. Here, we have broken down the steps to show how to assemble an in situ seurat object from the key components, in case the platform specific readers don't work for your specific data.
```{r}
coords <- CreateFOV(coords = list(centroids = CreateCentroids(data$centroids), 
                                  segmentation = CreateSegmentation(data$segmentations)),
                    type = c("segmentation", "centroids"),
                    molecules = data$microns,
                    assay = "XENIUM")
seurat[["COLONC2"]] <- coords  
```

Inspect the object - now, you can see we have added a spatial field of view:

To subset the object
```{r}
cropped <- Crop(seurat[["COLONC2"]], x = c(1000, 3000), y = c(3000, 6000), coords = "plot")
seurat[["CRC2"]] <- cropped
ImageDimPlot(seurat, fov= "CRC2",axes = T)
```
Adding control probes and codewords as separate assays in the Seurat object allows for the tracking and analysis of technical artifacts and noise within your spatial transcriptomics data, while keeping these outputs separate from the main biological gene expression values.


**Unassigned codewords** are unused codewords. There is no probe in a particular gene panel that will generate the codeword.

**Negative control probes** are probes that exist in the panels but target non-biological sequences. They can be used to assess the specificity of the assay.

**Negative control codewords** are codewords in the codebook that do not have any probes matching that code. They are chosen to meet the same requirements as regular codewords and can be used to assess the specificity of the decoding algorithm.


```{r}
seurat[["Negative.Control.Codeword"]] <- CreateAssayObject(counts = data$matrix[["Negative Control Codeword"]])
seurat[["Negative.Control.Probe"]] <- CreateAssayObject(counts = data$matrix[["Negative Control Probe"]])
seurat[["Unassigned.Codeword"]] <- CreateAssayObject(counts = data$matrix[["Unassigned Codeword"]])
```

subset an object
```{r}
# x = c(1000, 3000), y = c(3000, 6000)
seurat_CRC2 <- seurat[, seurat$x_centroid >= 1000 & seurat$x_centroid <= 3000 & seurat$y_centroid >= 3000 & seurat$y_centroid <= 6000]
seurat_CRC2 #read the object
seurat #read the object
saveRDS(seurat_CRC2, file="CRC2_subset.RDS")
```
Let's start with some basic QC and visualisation of the data. 

In Seurat, *in situ* spatial transcriptomics counterpart functions to *'SpatialDimPlot'* and *'SpatialFeaturePlot'* we covered yesterday are called *'ImageFeaturePlot'* and *'ImageDimPlot'*. These have additional functionality to plot cell segmentations and individual transcript coordinates, but otherwise function exactly the same as the sequencing based ST counterparts. 

First, lets visualise the total transcripts detected per cell.

As in scRNA-Seq data, this is the most basic measure of overall signal and how well the data looks. 

Unlike in scRNA-Seq data or unbiased sequencing-based ST, these measures are also very heavily dependent not only on the total RNA quantity of each cell and tissue quality, but also on the target panel used for the experiment. Under-represented cell types will naturally yield fewer transcripts.
Finally, the quality of cell segmentation also plays a role.

In this case, we can see that there are areas with higher and lower total transcripts detected. 

Understanding your tissue and target panel here is important to delineate where these differences are biological and where they may be technical.

```{r}
#x = c(1000, 3000), y = c(3000, 6000),
```
Similarly, we can visualise the total number of gene detected per cell. You can see that this is a bit less variable across tissue. 

This can also suggest that there cells at the top of the epithelial crypts in this sample with genes detected at high copy number than the rest of the tissue.

```{r}
ImageFeaturePlot(seurat_CRC2, "nFeature_XENIUM", axes = T) + scale_fill_viridis_c()
```

This code examines the distribution of the number of features (genes) detected per cell in the Seurat object using a density plot and calculates specific quantiles of this distribution. This is important for understanding the variability and distribution of detected features, which can help identify potential issues such as low-quality cells and determine any filtering thresholds that may need to be applied.

If you're coming from scRNA-Seq work, these low numbers probably look very alarming. How can you possibly work with 31 median genes per cell?

Unlike scRNA-Seq data and sequencing-based ST, both gene dropouts and noise are much, much lower in *in situ* ST data. 

We are also working with 100-fold fewer targetted genes.


```{r}
ggplot(seurat_CRC2[[]], aes(nFeature_XENIUM)) + geom_density()
quantile(seurat_CRC2$nFeature_XENIUM, c(0.01, 0.1, 0.5, 0.9, 0.99))
```
Using *ImageFeaturePlot* to visualize the cell area in spatial transcriptomics data allows us to examine the spatial organization and potential heterogeneity of cell sizes within your tissue sample.

**Why do we get such a difference in spatial distribution of cell sizes?**

This could be due to biological differences between small and large cells - e.g. small cells like T-cells. 

However, here the signal correlates with areas of low cellularisation. Therefore, it is likely this is an artefact of nuclei expansion in cell segmentation. 

What is Nuclei Expansion?

Nuclei expansion in cell segmentation refers to the process of enlarging the segmented nuclei regions to approximate the boundaries of the entire cells. This technique is used to better represent the actual cell boundaries when only the nuclei have been explicitly segmented/we only have DAPI and no additional cell boundary staining. The primary goal is to provide a more accurate estimation of the cellular area, which is crucial for various downstream analyses in spatial transcriptomics and single-cell studies. In this case, nuclei expansion is constrained either by maximum distance or other nearby cells - so, where there are no other nearby cells to "bump into", the expansion generates artificially bigger cells.


```{r}
ImageFeaturePlot(seurat_CRC2, "cell_area", axes = T) + scale_fill_viridis_c()
```
We can further check that this is likely the case by plotting the ratio between nuclei and total cell area. We can see that there is a very big decrease in percentage of cell area occupied by nucleus in areas of low cell density.

The cell-to-nucleus area ratio can also potentially provide insights into cell morphology, cell type and potential changes in cellular states or conditions. For example, T-Cells can often be quite well identified by this variable alone, as they have a small cytoplasm volume.  However, without a cell boundary stain, this metric mainly captures segmentation artefacts, so be careful about over-interpretation!

```{r}
seurat_CRC2$cell_nucleus_ratio <- seurat_CRC2$nucleus_area / seurat_CRC2$cell_area
ImageFeaturePlot(seurat_CRC2, "cell_nucleus_ratio") + scale_fill_viridis_c()
```

If we look at the distribution, we see that we have a big tail end of overly large cells.

```{r}
ggplot(seurat_CRC2[[]], aes(cell_area)) + geom_density()
```
In this case, we can see that as expected, there is generally a correlation between cell area and transcript detection rate. 

However, we also have a group of cells where this is not the case - very large cells but relatively few transcripts. These cells are mainly submucosal stromal cells which are very poorly covered by the panel 10x have used. 


```{r}
ggplot(seurat_CRC2[[]], aes(nCount_XENIUM, cell_area)) + geom_point() 
```
We can create a filter to remove the overly large cells from the analysis.

*quantile(seurat$cell_area, 0.99)*: Calculates the 99th percentile of the cell_area values in the Seurat object. This value serves as a threshold to identify the largest 1% of cells - but what is a sensible threshold, if any, depends on your tissue.

*seurat$cell_area < quantile(seurat$cell_area, 0.99)*: Compares each cell's area to the 99th percentile threshold. The result is a logical vector where each element is TRUE if the corresponding cell's area is less than the 99th percentile and FALSE otherwise.


*seurat[["SIZE_FILTER_LARGE"]]*: Creates a new metadata field named SIZE_FILTER_LARGE in the Seurat object, storing the logical vector.


```{r}
seurat_CRC2[["SIZE_FILTER_LARGE"]] <- seurat_CRC2$cell_area < quantile(seurat_CRC2$cell_area, .99)
```

Now we can use *ImageDimPlot* to visualise the cells which have been flagged for removal.

We can see that these are mostly in the submucosa region. 

**How do different thresholds behave? Is there a more appropriate one to use? Is any necessary at all?**

```{r}
ImageDimPlot(seurat_CRC2, group.by="SIZE_FILTER_LARGE")
```
We can use the same approach to create a filter for segmented cells which are very small and likely segmentation arfetacts. 

*quantile(seurat$cell_area, 0.01)*: Calculates the 1st percentile of the cell_area values in the Seurat object. This value serves as a threshold to identify the smallest 1% of cells.

*seurat$cell_area > quantile(seurat$cell_area, 0.01)*: Compares each cell's area to the 1st percentile threshold. The result is a logical vector where each element is TRUE if the corresponding cell's area is greater than the 1st percentile and FALSE otherwise.

*seurat[["SIZE_FILTER_SMALL"]]*: Creates a new metadata field named SIZE_FILTER_SMALL in the Seurat object, storing the logical vector.


```{r}
seurat_CRC2[["SIZE_FILTER_SMALL"]] <- seurat_CRC2$cell_area > quantile(seurat_CRC2$cell_area, .01)
```

Now we can use *ImageDimPlot* to visualise the cells which have been flagged for removal.

We can see that these are more scattered throughout the tissue - but there may be more in the follicular regions. 

**How do different thresholds behave? Is there a more appropriate one to use? Is any necessary at all?**

```{r}
ImageDimPlot(seurat_CRC2, group.by="SIZE_FILTER_SMALL")
```
We can check how these values correlate with gene detection rate. 

If we filter out small cells, we will remove cells with low numbers of genes detected. 

If we filter out large cells, this is not that biased towards overly large counts, as we saw before.


```{r fig.height=10, fig.width=7}
p1 <- VlnPlot(seurat_CRC2, "nFeature_XENIUM", group.by = "SIZE_FILTER_SMALL", pt.size = .1, alpha = .5) + labs(title="Small Cell Filter")
p2 <- VlnPlot(seurat_CRC2, "nFeature_XENIUM", group.by = "SIZE_FILTER_LARGE", pt.size = .1, alpha = .5)+ labs(title="Large Cell Filter")

p1 + p2
```
Adjusting the threshold for what is considered a "small cell" can have significant implications for your analysis, especially in areas with specific cell types such as T-cells, which are small and densely packed in follicular regions. This example demonstrates how changing the threshold to the 10th percentile affects the filtering. In this case, we would probably filter out a lot of good cells that we don't want to lose! So, be careful when looking at these types of QC metrics!


```{r}
seurat_CRC2[["SIZE_FILTER_SMALL"]] <- seurat_CRC2$cell_area > quantile(seurat_CRC2$cell_area, .1)
```

```{r}
ImageDimPlot(seurat_CRC2, group.by="SIZE_FILTER_SMALL")
```
Lets set this back to the original 1% threshold.
```{r}
seurat_CRC2[["SIZE_FILTER_SMALL"]] <- seurat_CRC2$cell_area > quantile(seurat_CRC2$cell_area, .01)
```


The most important filter is the overall transcript detection. Empty cells or cells with very low transcript count cannot be taken forward for clustering analysis and it is extremely difficult to identify what they may be. Here, we set a threshold of minimum 15 transcripts. This seems quite low - for data from *in situ* platforms with low noise (Xenium, Merfish, Merscope), this is generally enough to cluster and identify cell types. If your data has more noise (e.g. CosMx), a higher threshold is more appropriate.


*seurat$nCount_XENIUM >= 15*: Compares each cell's transcript count to the threshold of 15. The result is a logical vector where each element is TRUE if the corresponding cell has at least 15 transcripts and FALSE otherwise.
*seurat$TRANSCRIPT_FILTER*: Creates a new metadata field named TRANSCRIPT_FILTER in the Seurat object, storing the logical vector.


```{r}
seurat_CRC2$TRANSCRIPT_FILTER <- seurat_CRC2$nCount_XENIUM >= 15
```

And we can visualise the cells that we would lose. 

We see that we disproportionately would filter out more cells from some regions than others. As pointed out previously, this is likely due to a combination of gene panel coverage in some regions and very small cells in densely packed regions like follicles.

```{r}
ImageDimPlot(seurat_CRC2, group.by="TRANSCRIPT_FILTER")
```
Finally, visualizing the counts of negative control codewords, negative control probes, and unassigned codewords helps identify and understand technical artifacts and background noise in your spatial transcriptomics data.

Here, we can see that all control probes and codewords produce yield very little signal, suggesting our data is good quality! 

In some cases, high amount of autoflourescence is the cells/tissue can sometimes generate false positive signal and this should be filtered out. 

```{r fig.height=7, fig.width=7}
ImageFeaturePlot(seurat_CRC2, "nCount_Negative.Control.Codeword") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "nCount_Negative.Control.Probe") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "nCount_Unassigned.Codeword") + scale_fill_viridis_c()
```

Although the negative control signal is low, we can nonetheless create a filter to remove cells which have any, although in this case it is probably unnecessary.

```{r}
seurat_CRC2$PROBE_FILTER <- seurat_CRC2$nCount_Unassigned.Codeword == 0 &
                       seurat_CRC2$nCount_Negative.Control.Codeword == 0 &
                       seurat_CRC2$nCount_Negative.Control.Probe == 0
```

```{r}
ImageDimPlot(seurat_CRC2, group.by="PROBE_FILTER")
```
Finally, we can subset the seurat object based on any/all of the filters we have created earlier. 

By combining probe, size, and transcript filters, you can retain only the cells that meet all quality criteria, reducing the impact of technical artifacts and noise on your analysis.

```{r}
seurat_CRC2 <- subset(seurat_CRC2, PROBE_FILTER & SIZE_FILTER_LARGE & SIZE_FILTER_SMALL & TRANSCRIPT_FILTER)
```
Lets examine the cleaned up object - we have lost a few thousand cells from the analysis. 
```{r}
seurat
ImageDimPlot(seurat_CRC2)
saveRDS(seurat_CRC2, file="CRC2_subset_filtered.RDS")
```
**Data Normalisation**

The *SCTransform* function in Seurat is used for normalizing single-cell RNA-seq and spatial transcriptomics data. This method models the gene expression counts using a regularized negative binomial regression and removes technical noise while preserving biological variability. The *clip.range* parameter is used to limit the range of the transformed values, which can help stabilize downstream analyses by limiting the influence of extreme values. 


```{r}
seurat_CRC2 <- SCTransform(seurat_CRC2, assay = "XENIUM", clip.range = c(-10, 10))
```

Principal Component Analysis (PCA) is a dimensionality reduction technique used to identify the primary axes of variation in high-dimensional data. In the context of spatial transcriptomics, PCA helps to reduce the complexity of the data while preserving the most important patterns of variation. 


TIP: If your target panel is very small, you can skip this step and carry out clustering analysis directly on gene expression. This can sometimes help with achieving better clustering results.
```{r}
seurat_CRC2 <- RunPCA(seurat_CRC2)
```
As before, we can visualise how much variation is captured by each PC. 

The ElbowPlot function helps to determine the number of significant PCs to use for downstream analyses. The plot typically shows the amount of variance explained by each PC, and the "elbow" point indicates a natural cutoff.


```{r}
ElbowPlot(seurat_CRC2, 50)
```
Plotting the top genes contributing to a specific principal component helps in understanding the biological factors driving the variation captured by that component. This type of plot highlights the genes with the highest loadings, which are the most influential in the principal component analysis.

```{r fig.height=9, fig.width=7}
PC_Plotting(seurat_CRC2, dim_number = 1)
```

The *FeaturePlot* function in Seurat is used to visualize the expression of a specific gene across cells in a given dimensionality reduction space (e.g., PCA). This helps to understand how the expression of a gene varies across the principal components.

```{r}
FeaturePlot(seurat_CRC2, "CEACAM5", reduction = "pca") + scale_color_viridis_c()
```
We can also examine how various PCs are distributed spatially. 

Here, we can see that high PC1 loadings enrich in follicular structures and low PC1 loadings enrich in crypt top cells.
```{r}
ImageFeaturePlot(seurat_CRC2, "PC_1") + scale_fill_viridis_c()
```

We can plot the expression of high (or low) loading genes to visualise how this correlates with our dimensionality reduction.
```{r}
ImageFeaturePlot(seurat_CRC2, "IGFBP7", size=.5) + scale_fill_viridis_c()
```
Next, we will use the reduced dimensionality data for clustering and cluster visualisation. 

*RunUMAP*: Perform Uniform Manifold Approximation and Projection (UMAP) to reduce the dimensionality of the data for visualization. The UMAP plot reduces the high-dimensional data to two dimensions, preserving the local and global structure of the data for visualization. Cells that are close together in the UMAP plot are similar in their gene expression profiles.
*seurat*: The Seurat object.
*dims = 1:20*: Specifies the principal components to use for UMAP.

*FindNeighbors*: Finding nearest neighbors helps to identify cells that are similar based on their PCA scores, which is used for clustering.
*seurat*: The Seurat object.
*reduction = "pca"*: Specifies that the PCA space should be used for finding neighbors.
*dims = 1:20*: Specifies the principal components to use for identifying neighbors.

*FindClusters*: Clustering identifies distinct groups of cells with similar gene expression patterns. The resolution parameter controls the granularity of the clustering.
*seurat*: The Seurat object.
*resolution = 0.7*: Sets the resolution parameter for clustering. Higher values lead to more clusters, while lower values lead to fewer clusters.

```{r}
seurat_CRC2 <- RunUMAP(seurat_CRC2, dims = 1:20)
seurat_CRC2 <- FindNeighbors(seurat_CRC2, reduction = "pca", dims = 1:20)
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.2)
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.4)
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.6)
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.8)
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 1.0)
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.3)
clustree(seurat_CRC2)
library(clustree)
```

Next lets visualise the clusters - firstly, based on transcriptome embedding.

*DimPlot*: Creates a scatter plot of cells in a reduced-dimensional space, by default now using UMAP dimensionality reduction.
*seurat*: The Seurat object containing the dimensionality reduction results and cluster assignments.
*label = TRUE*: Adds cluster labels to the plot.
*repel = TRUE*: Repels the labels to avoid overlapping, making the plot clearer.


```{r}
DimPlot(seurat_CRC2, label=T, repel=T)
seurat_CRC2$SCT_snn_res.0.2
```
And now lets plot the clusters in tissue space. 

We can see that our clusters have quite nice correspondence to distinct spatial regions.
```{r}

ImageDimPlot(seurat_CRC2, size=.5)
```
As before, now we can use Seurat differential expression functions to identify marker genes for specific cell clusters.

*FindMarkers*: Identifies genes that are differentially expressed in a specified cluster compared to all other cells.
*seurat*: The Seurat object containing the gene expression data and cluster identities.
*ident.1 = "0"*: Specifies the cluster of interest for which marker genes are to be identified. In this case, cluster "0".
*max.cells.per.ident = 500*: Limits the number of cells to be used from each cluster for the differential expression analysis to 500. This can help to speed up the computation.


```{r}
markers <- FindMarkers(seurat_CRC2, ident.1="0", max.cells.per.ident=500)
```

```{r}
head(markers)
```

We can visualise expression of cluster specific markers using feature plots
```{r}
FeaturePlot(seurat_CRC2, "CD3E", label=T, repel=T)+ scale_color_viridis_c(direction=-1) #t cell
FeaturePlot(seurat_CRC2, "MS4A1", label=T, repel=T)+  scale_color_viridis_c(direction=-1) #B cell
FeaturePlot(seurat_CRC2, "CEACAM5", label=T, repel=T)+ scale_color_viridis_c(direction=-1) #colorectal cancer and non-small-cell lung cancer
FeaturePlot(seurat_CRC2, "KIT", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
#he KIT gene is a cell surface marker and predictive biomarker that can be used for a variety of purposes, including: 
#ancer diagnosis
#KIT gene mutations can be used to diagnose, predict, and provide prognostic information for certain cancers, such as acute myeloid leukemia (AML), melanoma, gastrointestinal stromal tumors (GIST), and systemic mastocytosis (SM)

ImageFeaturePlot(seurat_CRC2, "CD3E", size=.5) + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "MS4A1", size=.5) + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "CEACAM5", size=.5) + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "KIT", size=.5) + scale_fill_viridis_c()
```
Or, as in our sequencing ST tutorial, detect and visualise top markers for every cluster.
```{r}
markers <- FindAllMarkers(seurat_CRC2, max.cells.per.ident = 500)
```

```{r}
head(markers)
```

scCustomize package provides a convenient helper function, *Extract_Top_Markers*, to extract the top marker genes for each cluster from the output of *FindAllMarkers*. This function simplifies the process of identifying and retrieving the most significant marker genes for analysis and visualisation.

In this case, we are extracting the top five markers per cluster.

```{r}
top <- Extract_Top_Markers(markers, num_genes = 5, named_vector = FALSE, make_unique = TRUE)
top
```

*Clustered_DotPlot* function from the *scCustomize* package provides a convenient and visually appealing way to display expression patterns of top marker genes across clusters using a dot plot. This function not only plots the expression data but also clusters the genes and groups for enhanced visual interpretation. This is an alternative to Seurat *DotPlot* function. 

*k = 18*: Determines the number of clusters for the hierarchical clustering of genes to enhance visual separation of expression patterns. 

We can see that most clusters have unique markers, which suggests the dataset is not over-clustered.

```{r fig.height=10, fig.width=7}
Clustered_DotPlot(seurat_CRC2, features = top, k=18)
```

**Additional Spatial Visualisations**

The resolution of *in situ* datasets is typically very high and so it can be difficult to visualise everything in one plot. Below, we will explore different visualisations that can help unpick and understand the data a bit better. 


To better visualise spatial distribution of clusters, sometimes it can be useful to subset only certain groups to reduce crowding.  Here, we specifically only visualising two selected clusters. 

*WhichCells*: Identifies cells based on specified criteria.
*seurat*: The Seurat object.
*expression = seurat_clusters %in% c(0, 5)*: Logical expression to select cells belonging to clusters 0 and 5.


**This works with *ImageFeaturePlot* too. Try it with some genes!**
```{r}
ImageDimPlot(seurat_CRC2, cells=WhichCells(seurat_CRC2, expression = seurat_clusters %in% c(0, 5)))
```

Sometimes, it can be useful to create additional fields of view of the data - for example, zooms of specific regions. 
First, let's look at the coordinate system by plotting the data and turning on the plotting of the axes, which are off by default to create nicer looking plots. 

This gives us a rough idea on where in the coordinate system to create any subsets or zooms of the data.

For example, if we want to zoom in on the follicle in the top right corner, we can see that it lies roughly between 4000-5000 and 8000-9000 coordinate regions. 

```{r}
ImageDimPlot(seurat_CRC2, axes = T)
```
So, let's create a new FOV with these coordinates. For this, we can use the *Crop* function. 

*seurat[["COLON"]]*: The spatial assay to be cropped.
*x = c(4200, 5000)*: The x-axis range for the crop.
*y = c(8000, 8800)*: The y-axis range for the crop.
*coords = "plot"*: Specifies the coordinate system to use (typically "plot" for spatial coordinates).

*seurat[["ROI1"]] <- cropped*: Adds the cropped region as a new FOV named "ROI1" in the Seurat object. This could be a more informative name, but avoid using underscores!

```{r}
cropped <- Crop(seurat[["COLON"]], x = c(4200, 5000), y = c(8000, 8800), coords = "plot")
seurat[["ROI1"]] <- cropped
```
Now we can limit our visualisations just to this region by specifying the name of the new FOV as an "fov" arguement. 

As we are zooming in closer to the tissue, we can also switch from plotting cell centroids (i.e. dots) by default to visualising cell segmentation boundaries. Plotting cell boundary polygons for large FOVs can be quite time consuming, and doesn't provide much more detail on a fully zoomed-out view. 


```{r fig.height=8, fig.width=8}
ImageDimPlot(seurat, fov="ROI1", boundaries="segmentation", border.color = "black" )
```
We can visualise gene expression or other continous variable on the new FOV as before.

For example, here we have MS4A1/CD20 expression, which is a B-Cell marker. We can see it quite nicely limited to the lymphoid follicle. 
```{r}
ImageFeaturePlot(seurat, "MS4A1", fov="ROI1", boundaries="segmentation" , border.color = "black") + scale_fill_viridis_c()
```

We can also overlay the coordinates of individual molecules to the plot. For example, here we are added some more T-cell and B-cell specific markers. 

This visualisation can be useful because molecules are stored independently of cells and cell boundaries in Seurat. Therefore, if there are regions where cell segmentation is not good, or if cells were filtered out from clustering analysis due to their low quality, the molecules will remain and can still be visualised this way.

For example, here we can see there are a few molecules of CXCR5 detected outside of cellular boundaries. 

```{r}
ImageFeaturePlot(seurat, "MS4A1", fov="ROI1", boundaries="segmentation", molecules=c("CXCR5", "FOXP3"), mols.size = .5, border.color = "black" ) + scale_fill_viridis_c()
```
**Cell Type Identification**

You can manually annotate your cell clusters, or you can classify them using a reference single-cell dataset. This process is simpler than for Visium data because our data is at the single-cell level, establishing a one-to-one relationship without the need for spot deconvolution.

However, our transcriptome is more limited here, and some cell types may not be well represented. Additionally, our single-cell reference might be missing some cell types that are not well captured by droplet-based technologies but are present in our tissue data.

In this example, we will use a single-cell reference dataset that we prepared earlier.

We will start by reading in the seurat RDS file.
```{r}
ref <- readRDS("/project/shared/spatial_data_camp/datasets/SINGLE_CELL_REFERENCES/COLON_HC_5K_CELLS.RDS")
```

Examine the object:
```{r}
ref
```
And plot the pre-computed cell clusters. We can see that here we have quite high level annotation. 
```{r}
DimPlot(ref)
```
We want to evaluate how much structural information is lost in single-cell data when limiting ourselves to the targeted gene set. Accurate cluster prediction is challenging if the current gene set does not adequately identify them. To do this, we will quickly re-embedd the data using only the genes present in our spatial transcriptomics data and keep the original cluster annotations derived from unbiased data.

In this example, we can observe that the limited gene set does a reasonably good job at distinguishing major cell populations. However, it struggles to differentiate between similar cell types, such as myofibroblasts and fibroblasts, as effectively as before.

```{r}
ref <- SCTransform(ref, residual.features =rownames(seurat_CRC2))
ref <- RunPCA(ref)
ref <- RunUMAP(ref, dims=1:20)
DimPlot(ref, label=T, repel=T)
```
If we visualise the specificity of the gene panel across our single cell reference clusters, we can see that the panel coverage is mainly concentrated across epithelial cells and T-Cells and other immune cells, with few specific markers expressed by stromal cells. 
```{r}
ps <- AggregateExpression(ref, features = rownames(seurat), normalization.method = "LogNormalize", assays="RNA", return.seurat = T)
ps <- ScaleData(ps, features=rownames(ps))
pheatmap(LayerData(ps, layer="scale.data"), show_rownames = F)
```


Next, we can use the standard Seurat integration and cross-classification workflow to transfer single-cell derived labels to our spatial object.

Briefly, the first function identifies anchors between the reference single-cell dataset (ref) and the query spatial dataset (seurat). Anchors are pairs of cells that are considered similar between the datasets. The *normalization.method = "SCT"* specifies that *SCTransform* normalization should be used.

The second step transfers the cell type labels from the reference dataset to the query dataset. The anchorset argument specifies the anchors found in the previous step. The *refdata = ref$CellType* argument specifies the cell type labels from the reference dataset to be transferred. The *prediction.assay = TRUE* argument indicates that the transferred labels should be stored in a new assay in the query dataset. The *weight.reduction = seurat[["pca"]]* argument specifies the dimensionality reduction to be used for weighting the transfer, and *dims = 1:30* specifies the number of dimensions to use.


```{r}
anchors <- FindTransferAnchors(reference = ref, 
                               query = seurat_CRC2, 
                               normalization.method = "SCT")

seurat_CRC2 <- TransferData(anchorset = anchors, 
                       refdata = ref$CellType, 
                       prediction.assay = TRUE,
                       weight.reduction = seurat_CRC2[["pca"]], 
                       query = seurat_CRC2, 
                       dims=1:30)

```

Unfortunately, the predicted labels and spatial clusters do not correspond clearly in all cases. This discrepancy is particularly evident in the middle regions of the UMAP, where many cells are predicted as epithelial cells - probably incorrectly!

How to improve this?

**Ensure Good Representation of Cell Type Markers in *in situ* Target Panel**
Most critically, before undertaking any experiments you want to ensure that there is good representation of all cell types in your target panel - in this case, there is not much to be done as the data has already been generated. 

**Review and Refine Reference Data:**
Ensure that the reference single-cell dataset is comprehensive and accurately annotated. If certain cell types are not well represented or annotated in the reference dataset, it can lead to misclassification.

**Increase the Number of Dimensions:**
Increasing the number of dimensions used in the UMAP and PCA steps might capture more variance in the data, leading to better label transfer.

**Filter and Preprocess Data:**
Filtering out low-quality cells or genes and performing additional preprocessing steps can enhance the accuracy of the transfer anchors and, consequently, the label predictions. 

**Manually Annotate or Correct Predictions:**
In cases where automatic label transfer is insufficient, consider manually annotating or correcting the predictions for critical regions to ensure accuracy.


```{r}
DimPlot(seurat_CRC2, group.by = "predicted.id",label = T)
FeaturePlot(seurat_CRC2, "predicted.id.score")
```
As before, we can also visualise the predicted cell labels in tissue space.
```{r}
ImageDimPlot(seurat_CRC2, group.by = "predicted.id")
```
In line with non-specific predictions, we can also see that the prediction score across these areas is lower. 

Outside of stromal cells, we can also see that prediction probability can be low in cells that embedd "between" clusters, for example between core T-Cells and B-Cells, two populations that should be distinct. 

This is often the case where cell segmentation is imperfect and partitions transcripts in such a way that it generates "artificial" doublets by pulling in transcripts from an adjacent cell. 
```{r}
FeaturePlot(seurat, "predicted.id.score")
```
For example, if we visualise the lineage markers for T-Cells and B-Cells, we can see that they are often "co-expressed" in the same cells when biologically, they should not be. 

The *FeatureScatter* function in Seurat is used to create a scatter plot showing the relationship between the expression levels of two genes across all cells. This visualization helps to identify potential correlations or patterns between the two genes.


```{r fig.height=5, fig.width=10}
FeatureScatter(seurat_CRC2, "MS4A1", "CD3D", jitter=T)
FeaturePlot(seurat_CRC2, c("MS4A1", "CD3D"))
```

```{r fig.height=8, fig.width=8}
ImageDimPlot(seurat_CRC2,  boundaries="segmentation", border.color = "black" )
```


```{r}
ImageDimPlot(seurat_CRC2)
```

```{r}
ImageDimPlot(seurat_CRC2, group.by = "predicted.id")
```
**Spatial Neighbourhood Analyis**

```{r}
coords <- GetTissueCoordinates(seurat_CRC2, which = "centroids")
rownames(coords) <- coords$cell
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 20, return.neighbor=TRUE)

```
Computing nearest neighbors
```{r}
cells <- WhichCells(seurat_CRC2, expression= SCT_snn_res.0.2 == 3)
adjacent <- TopNeighbors(neighbours, cells, n = 10)

Idents(seurat_CRC2) <- "Other Cells"
seurat_CRC2 <- SetIdent(seurat_CRC2, cells = adjacent, "Adjacent Cells")
seurat_CRC2 <- SetIdent(seurat_CRC2, cells = cells, "Cells of Interest")

ImageDimPlot(seurat_CRC2)

seurat_CRC2[["group1"]] <- Idents(seurat_CRC2)
```
**Finding Spatially Correlated Genes**
```{r}
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 50)
mt <- LayerData(seurat_CRC2, layer = "counts", assay = "XENIUM")
sum_mtx <- as.matrix(neighbours$nn %*% t(mt))

```

We can store the neighbourhood-aggregated values in our Seurat object as a separate assay, which we will call "NEIGHBOURHOOD50". We then normalise the matrix. 
```{r}
seurat_CRC2[["NEIGHBOURHOOD50"]] <- CreateAssayObject(t(sum_mtx))
seurat_CRC2 <- NormalizeData(seurat_CRC2, assay = "NEIGHBOURHOOD50")

```
We can then apply quick correlation calculations to identify spatially correlated features. 


```{r}
corrgenes <- cor(as.matrix(t(LayerData(seurat_CRC2, assay = "NEIGHBOURHOOD50", layer = "data"))))
diag(corrgenes) <- 0
high_corr_genes <- which(rowMaxs(corrgenes) > .7)
diag(corrgenes) <- 1
heatmap <- pheatmap(corrgenes[high_corr_genes, high_corr_genes], border_color = NA)
```
```{r}
modules <- cutree(heatmap$tree_row, 5)
modules
```
Lets visualize some of the detected spatially co-localizing genes. For example,  module 2 genes - we can see that CEACAM6 and AQP8 are spatially similar, but not necessarily always expressed by the same cells.
```{r}
ImageFeaturePlot(seurat_CRC2, "AKR7A3") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "C1QBP") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "CD24") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "ANXA1") + scale_fill_viridis_c()
```

```{r}
seurat_CRC2 <- AddModuleScore(seurat_CRC2, features=split(names(modules), modules), assay = "SCT", nbin=3, name = "MOD" )
```
Visualising module scores - we can see that we have identified a group of genes co-localising at the base of the epithelial crypts (MOD1) and another module of genes co-localising in lymphoid follicles.
```{r}
ImageFeaturePlot(seurat_CRC2, "MOD1") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "MOD2") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "MOD3") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "MOD4") + scale_fill_viridis_c()
ImageFeaturePlot(seurat_CRC2, "MOD5") + scale_fill_viridis_c()
```
**Detecting Cellular Niches**
```{r}
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 100)
diag(neighbours$nn) <- 0 # dont count transcriptome of the cell itself, just neighbours
mt <- LayerData(seurat_CRC2, layer = "counts", assay = "XENIUM")
sum_mtx <- as.matrix(neighbours$nn %*% t(mt))
```

How is this useful? Well, now you can cluster cells not on their gene expression values, but gene expression values of surrounding cells. This effectively partitions cells not based on their identity, but on their micro-environment!
Using this approach, you can identify tissue niches

Alternative approaches - you could count cell types rather than gene expression values, but that requires you to have finalised cell annotation for your dataset, which is not ideal. So, we do unbiased transcriptomics approach. 

**How would you run this with cell types?**
```{r}
seurat_CRC2[["NEIGHBOURHOOD100"]] <- CreateAssayObject(t(sum_mtx))
DefaultAssay(seurat_CRC2) <- "NEIGHBOURHOOD100"
seurat_CRC2 <- NormalizeData(seurat_CRC2)
seurat_CRC2 <- ScaleData(seurat_CRC2, features = rownames(seurat_CRC2))
seurat_CRC2 <- RunPCA(seurat_CRC2, features = rownames(seurat_CRC2))
seurat_CRC2 <- FindNeighbors(seurat_CRC2, reduction = "pca", dims = 1:10)
seurat_CRC2 <- FindClusters(seurat_CRC2, resolution = 0.1, cluster.name = "Niches")

```
Lets visualise the detected "niches". We can see that we have achieved a coarse partioning of the cells into crypt top, mid-crypt and crypt-base regions, as well as segmenting out follicles and sub-mucosal stroma.

**How would you tweak the above approach to generate more or less granular niches?**
```{r}
ImageDimPlot(seurat_CRC2, group.by = "Niches")
```
We can tabulate our detected niches with predicted cell type labels (or clusters) to visualise enrichment of different cell types across spatial niches. 

For example, as could be expected, T-Cells and B-Cells enrich in Niche 2 (follicular).

```{r}
comp <- table(seurat_CRC2$Niches, seurat_CRC2$predicted.id)
pheatmap(comp, scale="row")
```

```{r}
comp <- table(seurat_CRC2$Niches, seurat_CRC2$SCT_snn_res.0.6)
pheatmap(comp, scale="row")
```
```{r}
saveRDS(seurat_CRC2, "seurat_CRC2_annot.RDS")
```


change the color
```{r}
library(RColorBrewer)
library(scales)
cell_colours <- c("#F8766D", "#DB8E00", "#AEA200", "#64B200", "#00BD5C", "#00C1A7", 
                  "#00BADE", "#00A6FF", "#B385FF", "#EF67EB", "#FF63B6")
names(cell_colours)  <- c("Epithelium", "Fibroblasts", "T-Cells",  "Myofibroblasts", "Macrophages", "Glia", "Endothelium", "Telocytes", "Plasma", "B-Cells", "Pericytes")
names(cell_colours) <- seurat_CRC2$predicted.id %>% unique()
ImageDimPlot(seurat_CRC2, cols = cell_colours, group.by = "predicted.id")
ImageDimPlot(seurat_CRC2, group.by = "predicted.id")
```

```{r}
merged <- readRDS("CRC_merge.rds")

CellpropPlot(seurat_CRC2, group.by= "predicited.id", prop.in="")
```







